Preferential attachment in the growth of social networks: the case of Wikipedia 
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We present an analysis of the statistical properties and growth of the free on-line encyclopedia 
Wikipedia. By describing topics by vertices and hyperlinks between them as edges, we can represent 
this encyclopedia as a directed graph. The topological properties of this graph are in close analogy 
with that of the World Wide Web, despite the very different growth mechanism. In particular we 
measure a scale-invariant distribution of the in- and out- degree and we are able to reproduce these 
features by means of a simple statistical model. As a major consequence, Wikipedia growth can 
be described by local rules such as the preferential attachment mechanism, though users can act 
globally on the network. 

PACS numbers: : 89. 75. He, 89. 75. Da, 89.75.Fb 



Statistical properties of social networks has become a 
major research topic in statistical physics of scale-free 
networks [l|, \& |3( ■ Collaboration systems are a typical 
example of social network, where vertices represent in- 
dividuals. In the actors' collaborations case for in- 
stance, the edges are drawn between actors playing to- 
gether in the same movie. In the case of firm boards of 
directors |(| , the managers are connected if they sit in 
the same board. In the scientific co-authorship networks, 
an edge is drawn between scientists who co-authored 
at least one paper. Other kinds of networks, such as in- 
formation ones, are the result of human interaction: the 
World Wide Web (WWW) is a well-known example of 
such, although its peculiarities often put it outside the 
social networks category 

In this paper, we analyze the graph of Wikipedia [Tlj . 
a virtual encyclopedia on line. This topic attracted very 
much interest in recent times for its topology. This 

system grows constantly as new entries are continuously 
added by users through the Internet. Thanks to the 
Wiki software |12| . any user can introduce new entries 
and modify the entries already present. It is natural to 
represent this system as a directed graph, where the ver- 
tices correspond to entries and edges to hyperlinks, au- 
tonomously drawn between various entries by contribu- 
tors. 

The main observation is that the Wikipedia graph 
exhibits a topological bow-tie-like structure, as does 
the WWW. Moreover, the frequency distribution for 
the number of incoming (in-degree) and outgoing (out- 
degree) edges decays as a power-law, while the in-degrees 
of connected vertices are not correlated. As these findings 
suggest, edges are not drawn toward and from existing 
topics uniformly; rather, a large number of incoming and 
outgoing edges increases the probability of acquiring new 
incoming and outgoing edges respectively. In the litera- 
ture concerning scale-free networks, this phenomenon is 
called "preferential attachment" [4|, and it is explained 
in detail below. 



Wikipedia is an intriguing research object from a so- 
ciologist's point of view: pages are published by a num- 
ber of independent and heterogeneous individuals in var- 
ious languages, covering topics they consider relevant 
and about which they believe to be competent. Our 
dataset encompasses the whole history of the Wikipedia 
database, reporting any addition or modification to the 
encyclopedia. Therefore, the rather broad information 
contained in the Wikipedia dataset can be used to val- 
idate existing models for the development of scale-free 
networks. In particular, we found here one of the first 
large-scale confirmations of the preferential attachment, 
or "rich-get-richer" , rule. This result is rather surpris- 
ing, since preferential attachment is usually associated 
to network growth mechanisms triggered by local events: 
in the WWW, for instance, webmasters have control on 
their own web pages and outgoing hyperlinks, and can- 
not modify the rest of the network by adding edges else- 
where. Instead, by the "Wiki" technology a single user 
can edit an unlimited number of edges and topics within 
the Wikipedia network. 

The dataset presented here gathers Wikipedia pages 
in about 100 different languages; the largest subset at 
the time of our analysis was made by the almost 500, 000 
pages of the English version, growing at an exponential 
pace^3|. A detailed analysis of the algorithms used 
to crawl such data is presented elsewhere 0|. Here, we 
start our analysis by considering a typical taxonomy of 
regions introduced for the WWW 171. The first region 
includes pages that are mutually reachable by traveling 
on the graph, named the strongly connected component 
(SCC); pages from which one reaches the SCC form the 
second region, the IN component, while the OUT com- 
ponent encompasses the pages reached from the SCC. A 
fourth region, named TENDRILS, gathers pages reach- 
able from the IN component and pointing neither to the 
SCC nor the OUT region. TENDRILS also includes 
those pages that point to the OUT region but do not be- 
long to any of the other defined regions. Finally TUBES 
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FIG. 1: The shape of the Wikipedia network 



connect directly IN and OUT regions, and few pages are 
totally disconnected (DISC). The result is the so-called 
bow-tie structure shown in Fig. 

TABLE I: Size of the bow-tie components of the Wikipedia 
for various languages. Each entry in the table presents the 
percentage of vertices of the corresponding graph that belong 
to the indicated bow-tie component. 
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As a general remark, Wikipedia shows a rather large 
interconnection; this means that most of the vertices are 
in the SCC. From almost any page it is possible to reach 
any other. This feature describes one of the few differ- 
ences between the on-line encyclopedia and the WWW: 
the content of an article can be fully understood by vis- 
iting a connected path along the network. 

The key quantities characterizing the structure of an 
oriented network are the in-degree (ki n ) and out-degree 
(k ut) distributions. As shown in fig. [3 both distribu- 
tions display an algebraic decay, of the kind -P(fcm,out) oc 

K£T\ with 2 < 7 ».™' < 2.2. 

Actually, in the case of the out-degree distribution, 
the value of the exponent seems to be rather dependent 
upon the size of the system as well as the region chosen 
for the fit. Given the sharp cutoff in this distribution, the 
cumulative method of plotting in this case could result 
in a quite larger value of the exponent. 

We proceeded further by studying the dynamics of the 
network growth. The analysis has been made in order to 
validate the current paradigm explaining the formation of 
scale-free networks, introduced by the Barabasi-Albert 
(BA) model Q. The latter is based on the interplay of 
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FIG. 2: in-degree (white symbols) and out-degree (filled sym- 
bols) distributions for the Wikipedia English (circles) and 
Portuguese (triangles) graph. Solid line and dashed line rep- 
resent simulation results for the in-degree and the out-degree 
respectively, for a number of 10 edges added to the network 
per time step. Dot-dashed lines show the k^' out (bottom line) 
and the k~^ out (top line) behavior, as a guide for the eye. 

two ingredients: growth and preferential attachment. In 
the BA model, new vertices are added to the graph at 
discrete time steps and a fixed number m of edges con- 
nects each new vertex to the old ones. The preferential 
attachment rule corresponds to assigning a probability 
LT(fci) ~ ki that a new vertex is connected to an existing 
vertex i whose degree is ki . This elementary process gen- 
erates a non-oriented network where the degree follows 
a power-law distribution. 

To observe such a mechanism in a real network, one 
builds the histogram of the degree of the vertices acquir- 
ing new connections at each time t, weighted by a factor 
N(t)/n(k,t), where N (t) is the number of vertices at time 
t and n(k,t) is the number of vertices with in-degree k 
at time t. jl9| . 

Since the Wikipedia network is oriented, the preferen- 
tial attachment must be verified in both directions. In 
particular, we have observed how the probability of ac- 
quiring a new incoming (outgoing) edge depends on the 
present in- (out-) degree of a vertex. The result for the 
main Wikipedia network (the English one) is reported 
in Fig|3J For a linear preferential attachment, as sup- 
posed by the BA model, both plots should be linear over 
the entire range of degrees, here we recover this behav- 
ior only partly. This is not surprising, since several mea- 
surements reported in literature display strong deviations 
from a linear behavior [20] for large values of the degree, 
even in networks with an inherent preferential attach- 
ment 19]. It is also possible that for certain datasets 
(i.e. English), the slope of the growth of II is slightly 
less than 1. Nevertheless it is worth to mention that 
the preferential attachment in Wikipedia has a somewhat 
different nature. Here, most of the times, the edges are 
added between existing vertices differently from the BA 
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FIG. 3: The preferential attachment for the in-degree and the 
out-degree in the English and Portuguese Wikipedia network. 
The solid line represents the linear preferential attachment 
hypothesis II ~ ki„ l0 ut- 



model. For instance, in the English version of Wikipedia 
a largely dominant fraction 0.883 of new edges is created 
between two existing pages, while a smaller fraction of 
edges points or leaves a newly added vertex (0.026 and 
0.091 respectively). 

To draw a more complete picture of the Wikipedia net- 
work, we have also measured the correlations between the 
in- and out-degrees of connected pages. The relevance of 
this quantity is emphasized by several examples of com- 
plex networks shown to be fully characterized by their 
degree distribution and degree-degree correlations [2l| . 
A suitable measure for such correlations is the average 
degree K^ nn \k) of vertices connected to vertices with 
degree k (for simplicity, here we refer to a non-oriented 
network to explain the notation). These quantities are 
particularly interesting when studying social networks. 
As other social networks, collaborative networks studied 
so far are characterized by assortative mixing, i.e. edges 
preferably connect vertices with similar degrees 8] . This 
picture would reflect in a growing with respect 

to k. If K^ nn \k) (decays) grows with k, vertices with 
similar degrees are (un)likely to be connected. This ap- 
pears to be a clear cutting method to establish whether a 
complex network belongs to the realm of social networks, 
if other considerations turn ambiguous |22|. 

In the case of an oriented network, such as Wikipedia, 
one has many options while performing such assessment: 
since we could measure the correlations between the in- 
or the out -degrees of neighbor vertices, along incoming 
or outgoing edges. We chose to study the average in- 
degree K^ n \ki n ) of upstream neighbors, i.e. pointing 
to vertices with in-degree k in . By focusing on the in- 
degree and on the incoming edges, we expect to extract 
information about the collective behavior of Wikipedia 
contributors and filter out their individual peculiarities: 
the latter have a strong impact on the out-degree of a 



vertex and on the choice of its outgoing edges, since con- 
tributors often focus on a single Wikipedia topic |l3j . 

Our analysis shows a substantial lack of correlation 
between the in-degrees of a vertex and the average in- 
degree of its upstream neighboring vertices. So, as re- 
ported in fig. incoming edges carry no information 
about the in-degrees of the connected vertices, since 
K^ nn \kin) display no clear increasing or decreasing be- 
havior when plotted against ki n . 
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FIG. 4: The average neighbors' in-degree, computed along 
incoming edg function of the in-degree for the English 

(circles) and Portuguese (triangles) Wikipedia, compared to 
the simulations of the models for N = 20000, M = 10, i?i = 
0.026 and R2 = 0.091 (dashed line) and a realization of the 
model where the first 0.5% of the vertices has been removed 
to reduce the initial condition impact (thick solid line). 



The above quantities, including the power law distri- 
bution of the degrees and the absence of degree-degree 
correlations, can be modeled by simple applications of 
the preferential attachment principle. Let us consider 
the following evolution rule, similarly to other models of 
rewiring already considered ^4|, for a growing directed 
network such as Wikipedia: at each time step, a vertex 
is added to the network, and is connected to the existing 
vertices by M oriented edges; the direction of each edge is 
drawn at random: with probability R\ the edge leaves the 
new vertex pointing to an existing one chosen with prob- 
ability proportional to its in-degree; with probability R2, 
the edge points to the new vertex, and the source vertex 
is chosen with probability proportional to its out-degree. 
Finally, with probability i?3 = 1 — i?i — R2 the edge 
is added between existing vertices: the source vertex is 
chosen with probability proportional to the out-degree, 
while the destination vertex is chosen with probability 
proportional to the in-degree. 

By solving the rate equations for ki n and k ou t by stan- 
dard arguments p|, we can show that this mechanism 
generates power law distributions of both the in-degree 
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and the out-degree: fej„ and k ou t. 

i i 
P{kin) — k in 2 

p(fc out ) ~ cr^ 1 (i) 

which can be easily verified by numerical simulation. 

By adopting the values empirically found in the En- 
glish Wikipedia Ri = 0.026, R 2 = 0.091 and R 3 = 0.883, 
one recovers the same power law degree distributions of 
the real network, as shows figure |2 

The degree-degree correlations (kin) can be com- 

puted analytically by the same lines of reasoning de- 
scribed in references H3| , and for 1 « t « JV we 
have 

Kt\hn) ~ (2) 

for i? 3 ^ 0, the proportionality coefficient depending only 
on the initial condition of the network, and 

K^ n) (k in )^MR 1 R 2 \nN (3) 

for i?3 = 0, where N is the network size. Both equations 
are independent from fej„ , as confirmed by the simulation 
reported in fig. 0]for the same values of Ri, R2 and R3. 

Therefore, the theoretical degree-degree correlation re- 
produces qualitatively the observed behavior; to obtain 
a more accurate quantitative agreement with data, it is 
sufficient to tune the initial conditions appropriately. As 
shown in fig. this can be done by neglecting a small 
fraction of initial vertices in the network model. 

In conclusion, the bow-tie structure already observed 
in the World Wide Web, and the algebraic decay of the 
in-degree and out-degree distribution are observed in the 
Wikipedia datasets surveyed here. At a deeper level, the 
structure of the degree-degree correlation also resembles 



that of a network developed by a simple preferential at- 
tachment rule. This has been verified by comparing the 
Wikipedia dataset to models displaying no correlation 
between the neighbors' degrees. 

Thus, the empirical and theoretical evidences show 
that traditional models introduced to explain non triv- 
ial features of complex networks by simple algorithms 
remain qualitatively valid for Wikipedia, whose techno- 
logical framework would allow a wider variety of evolu- 
tionary patterns. This reflects on the role played by the 
preferential attachment in generating complex networks: 
such mechanism is traditionally believed to hold when 
the dissemination of information throughout a social net- 
work is not efficient and a "bounded rationality" hypoth- 
esis |3 HH is assumed. In the WWW, for example, the 
preferential attachment is the result of the difficulty for a 
webmaster to identify optimal sources of information to 
refer to, favoring the herding behavior which generates 
the "rich-get-richer" rule. One would expect the coordi- 
nation of the collaborative effort to be more effective in 
the Wikipedia environment since any authoritative agent 
can use his expertise to tune the linkage from and toward 
any page in order to optimize information mining. Nev- 
ertheless, empirical evidences show that the statistical 
properties of Wikipedia do not differ substantially from 
those of the WWW. This suggests two possible scenarios: 
preferential attachment may be the consequence of the 
intrinsic organization of the underlying knowledge; alter- 
natively, the preferential attachment mechanism emerges 
because the Wiki technical capabilities are not fully ex- 
ploited by Wikipedia contributors: if this is the case, 
their focus on each specific subject puts much more ef- 
fort in building a single Wiki entry, with little attention 
toward the global efficiency of the organization of infor- 
mation across the whole encyclopedia. Authors acknowl- 
edge support from European Project DELIS. 
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